24
Quantization of Neural Networks
Quantize MLP
Quantize Weights in MHSA
Quantize Activations in MHSA
Fully Quantization
DeiT-Small
DeiT-Base
75.64
68.87
68.03
78.72
71.97
70.25
81.8
78.12
79.9
80.23
FIGURE 2.5
Analysis of bottlenecks from an architecture perspective. We report the accuracy of 2-
bit quantized DeiT-S and DeiT-B on the ImageNet data set to replace the full precision
structure.
only a drop of 1.78% and 4.26%, respectively. And once the query, key, value, and attention
weights are quantized, even with all weights of linear layers in the MHSA module in full
precision, the performance drops (10.57%) are still significant. Thus, improving the attention
structure is critical to solving the performance drop problem of quantized ViT.
Optimization bottleneck. We calculate l2-norm distances between each attention weight
among different blocks of the DeiT-S architecture as shown in Fig. 2.6. The MHSA modules
in full-precision ViT with different depths learn different representations from images. As
mentioned in [197], lower ViT layers pay more attention to global representations both
locally and globally. However, fully quantized ViT (blue lines in Fig. 2.6) fails to learn
accurate distances from the attention map. Therefore, it requires a new design to use full-
precision teacher information better.
2.3.3
Information Rectification in Q-Attention
To address the information distortion of quantized representations in forward propagation,
we propose an efficient Q-Attention structure based on information theory, which statisti-
cally maximizes the entropy of the representation and revives the attention mechanism in
the fully quantized ViT. Since the representations with extremely compressed bit width in
fully quantized ViT have limited capabilities, the ideal quantized representation should pre-
serve the given full-precision counterparts as much as possible, which means that the mutual
information between quantized and full-precision representations should be maximized, as
mentioned in [195].
We further show the statistical results that the query and key value distribution in ViT
architectures intended to follow Gaussian distributions under distilling supervision, whose
histograms are bell-shaped [195]. For example, in Fig. 2.3 and Fig. 2.7, we have shown the
query and key distributions and their corresponding Probability Density Function (PDF)
using the calculated mean and standard deviation for each MHSA layer. Therefore, the
query and key distributions in the MHSA modules of the full-precision counterparts are
formulated as follows.
q ∼N(μ(q), σ(q)),
k ∼N(μ(k), σ(k)).
(2.18)
Since weight and activation with a highly compressed bit width in fully quantized ViT
have limited capabilities, the ideal quantization process should preserve the corresponding